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Abstract. We present a Bayesian method for the identification and classification of objects from 
sets of astronomical catalogs, given a predefined classification scheme. Identification refers here 
to the association of entries in different catalogs to a single object, and classification refers to 
the matching of the associated data set to a model selected from a set of parametrized models 
of different complexity. By the virtue of Bayes' theorem, we can combine both tasks in an efficient 
way, which allows a largely automated and still reliable way to generate classified astronomical 
catalogs. A problem to the Bayesian approach is hereby the handling of exceptions, for which no 
likelihoods can be specified. We present and discuss a simple and practical solution to this problem, 
emphasizing the role of the "evidence" term in Bayes' theorem for the identification of exceptions. 
Comparing the practice and logic of Bayesian classification to Bayesian inference, we finally note 
some interesting links to concepts of the philosophy of science. 
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INTRODUCTION 

The identification and classification of objects from a set of independent catalogs is 
a key task for making astronomical data usable for scientific analysis. The standard 
approach here is to solve this problem step by step using to use hierarchical "best- 
match" algorithms, as exemplified in the cross-identification of radio sources [1] from 
the VizieR database of astronomical catalogs [2] . Although such algorithms are fast and 
efficient in low-level applications, they have limitations in dealing with ambiguities and 
considering object classes with different levels of complexity. This is illustrated in the 
recent production of the band-merged version [3] of the Planck Early Release Compact 
Source Catalog (ERCSC) [4], and the variability classification of ERCSC objects using 
WMAP data [5]. 

Motivated by this, we present here a Bayesian approach to object identification and 
classification, based on data from a set of astronomical catalogs taken, e.g., at different 
frequencies or by different observatories. In this method, we consider not only positional 
coincidence between catalog entries, but also the properties of known object classes, and 
use both as criteria for the identification and simultaneous classification of objects. The 
current paper focuses on the mathematical basics of this method, with some typical 
choices for priors and likelihoods needed for catalog generation. Applications to data, 
and a more detailed comparison with standard approaches will follow in future work. 
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BAYESIAN ASSOCIATION AND CLASSIFICATION 



Terms and Definitions 

Notational Conventions. We understand probability T in the Bayesian sense as an 
operator which assigns a value of plausibility, < CP(6) < 1, to a statement 6, and 
we introduce the information Hamiltonian [6] for a condition X on 6 as 'K{3l\&) = 
— log J'(6|X). Data (factual information) shall be denoted by blackboard-bold symbols 
(e.g., D), models (abstract beliefs) by calligraphic symbols (e.g., M). We denote a set of 
logically independent statements as {&j}± and define for any condition X 

5'({6,uix)^n,a'(e,|x) . (1) 

A set of mutually exclusive statements shall be denoted with with the definition 

m&jhm^Ejn&jm ■ (2) 

If a set on mutually exclusive statements is exhaustive, we call it a complete set of 
alternatives {{6^}};, with y({{6j}};|X) = 1. The operator ]V({6j}) gives the number of 
elements of a set with j > 0, a set containing a zero-indexed element is denoted {&o,&j}. 

Structure of Data and Associations. From a set of positions taken from a highly 
reliable seed catalog we select within a radius Aj potentially associated data D = {Dj}_l 
from N(D) independent target catalogs, where the seed catalog may be included as the 
zero-indexed element, Dq. The entries of each target catalog j form a complete set of 
alternatives, = {{Djo,%fe}}(, where Djo = WjOji^j} stands for the non-observation 
in the catalog j with noise level ajo and signal-to-noise limit uj, together with N(Dj) 
data entries Djk = {%A;Oi%fci}±- %fco = {<^jk,^jk} denote the positional distance of a 
data entry to the nominal seed coordinates and its error, while Bj^i = {fjkiiO'jki} contain 
?sf(Djfc) physical parameters and their errors. Finally, we define an association ai as a 
mapping determining one entry Dja^^ of each catalog j, and denote a^D = {Dja^^ jx. 
Obviously, associations form a complete set of alternatives, a = {{a^}};. 

Models, Parameters and the Classification Scheme. Classification is based on a set 
of mutually exclusive models M = {M„};, each providing a physical description of a 
known object class as a set of functions /i„i {xj ; o^) that can be compared to the data values 
fjki. Xj is a physical quantity mapping a model prediction on a particular catalog (e.g., 
nominal frequency), and cj is a vector in the model parameter space rj„ of dimension 
dimOn. The (prior) probability assigned to a model is understood as a marginalization 
over the model parameter space, i.e., ^P(M„) = dcj Pn(<^), where p„(a;) is called the 
parameter p.d.f. of M„. 

A priori, we cannot assume that {M„}; is exhaustive. This would mean D'(M) < 1, 
which poses a problem for the proper normalization of Bayesian posterior probabilities. 
We therefore introduce the classification scheme C!: as a set of conditions that allows 
us to treat M as an exhaustive set, and write D'(M|(£) = 1 and M|(r = {{M„}};. In a 
more general sense, C can be understood as the framework of factual information (data), 
beliefs (theories and ancillary hypothesis) and decisions {e.g., how to classify objects), 
which enables us to define and delimit our set of models M|c;. 
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Application of Bayes' Theorem 



Separating Association and Classification. The posterior probability for a candidate 
object aeJAn can be written under application of the product rule as 



(3) 



The posterior probability of an association depends only on the set of coordinates which 
we denote by K, and we can omit (t in the condition of this term. By application of 
Bayes' theorem, both terms can be separately transformed as 



9{al^'Mn*t)9{Jiin\€) 
9{K\a,)y{a,) 



(4) 
(5) 



As both M|e and a form complete sets of alternatives we obtain the evidence terms 

7sr(M) 

^(a^DlC) = V / dupn{ui\€)y{ain\'Mnu:€) , 
n=i -^^^ 



(6) 
(7) 



where we have written the model likelihoods J'(a^D|M„<2!:) explicitly as integrals over 
their constrained parameter p.d.f.s p„(a;|C), fulfilling J2nln d^Pn('^l'2^) = 1- 

Priors and Likelihoods of Association. Associations by itself are just abstract com- 
bination of numbers. Without referring to data or physical models, we have to assign the 
same value ^(a^) > for all i, except for some which can be excluded with certainty 
{e.g., T{a£) = if = 0). As a constant prior ^(a) cancels in Eqs. 5 and 7, we can set 
9{ai) = 1 in all terms with T{ae) / 0. 

The likelihood of an association is determined by two contributions: (a) the proba- 
bility of an associated data point to be observed at a effective distance 6jk within an 
effective accuracy djk, and (b) the confusion probability ipj{k) to have a given number k 
of unrelated data points in a catalog of mean source density r]j within a radius Aj, i.e., 
the Poisson probability 



(8) 



The effective distance and accuracy consider the seed position error 6jo by defining 



9jk = \j9% + ^'jo and djk = y^^k + Sjo^ thus T(Kjk\ae) « {ejk/Sjk) exp{-e]j26'^f^). It is 
then straightforward to see^'"* that 




1 

+ 2 



a2 



1 



■Nip) 

logV'(]V(D,)) . (9) 



The seed catalog does not contribute to this term as its association is logically implied. 
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Priors and Likelihoods of Classification. Priors in classification are given by the 
model parameter p.d.f.s, the functional shape of which is part of the model generation 
and will not be discussed here. For the normalization of the prior p.d.f.s, relative abun- 
dances of known object classes from previous classifications can be used. 

The classification likelihood is the probability of the data points and non-observations 
in a^D to match with the model prediction, and we can write 



2 



Af(D) 

E 

j=i 



log erfc 



1 

V2 



(10) 



The second term considers the contribution of assumed non-observations, and 

, 2 



X 



E 

3=1 i=l 



E 



fjaiji fJ-niixj'ji^) 



a 



(11) 



is the usual "goodness-of-fit" measure for the data points in agi 



2,3 



Classifying Objects 

Definition and Properties of Confidence. Following Jaynes [7, § 4] we use the log- 
arithm of the odds ratio to compare our classifications and define the confidence of a 
candidate object aiMn as 

Cln = log —. |.^^, . (12) 



l-9{aiMn\0(t)_ 

The object of choice would then be the candidate object with maximum confidence, 
Cmax, and we denote the corresponding indices as £max and n^a.^. Analogously, we can 
define the confidence of a data association as 



(13) 



and denote value and index of the maximum as Omax and £max, respectively. 

Eq. 3 implies that q„ < a£ for all n. Because of {{a^M„}};, we can have q„ > for only 
one combination in. As this implies ai > 0, it can be taken as a condition for a unique 
and consistent object choice, preferring one q^M„ over all others. In the contrary, we 
cannot conclude from > that Cmax > 0, neither we can conclude that £ = ^max- 



Likelihoods containing contributions from both data point associations (Gaussian p.d.f.s) and non- 
observations (probabilities) require to define conditions to normalize the relative contribution of both 
kind of terms. In 3-C(a£|K) this is done by requiring 'P{Kjk\ae) — > 1 for matching coordinates measured 
with arbitrary precision, 6jk < Sjk < Sjo 0. 

^ Here we require that for the fiducial case fj,niixj;u}) = VjC^jo, the probability of a data point {i^jCTjo, ctjo} 
to be consistent with the model prediction is equal to the probability of a non-observation. 
^ We use Gaussian p.d.f.s as we assumed in our data structure that only one error parameter is given for 
each position or quantity. If more detailed error information is available, the definition of the correspond- 
ing likelihoods has to be adapted. 
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Quality Rating. Based on the discussion above, we can define for each object a qual- 
ity rating, defining potential actions to be taken for catalog validation and verification. 
The most basic scheme would contain four ratings as follows. 

Rating A selects clear cases, and is given for Cmax > C'um > 3nat, corresponding to a 
rejection probability of the best alternative of > 95% (1 nat = 4.34db). No or little human 
inspection is necessary in these cases, and results of rejected object associations can be 
deleted. Rating B would be applied for Cmax > 0, and indicates likely cases, while rating 
C would be applied to potentially ambiguous cases with amax > while Cmax < 0. Both 
require human inspection at different levels, and all results with q„ ~ Cmax should be kept 
for validation. Finally, a rating D (omax < 0) identifies objects which would normally be 
rejected in catalog generation, but which may still be interesting to look at for research 
purposes. Of course, this scheme may be adapted to the needs of reliability, and it may 
make sense to split up rating A using a sequence of increasing Ciim- 



Our plan is to drop a lot of odd objects 
onto your country from the air. 
And some of these objects will be useful. 
And some of them will just be odd. 

Laurie Anderson 

Counter-evidence. We are left with a problem: Assume there is an object which does 
not fit into any of our model classes. How would it appear in our classification? 

Obviously, for such objects all integrals in the sum of Eq. 6 would become very 
small, so 'S'{aeB\^) would become very small, even if the data association has a high 
confidence. We therefore introduce the counter-evidence for an associated object to fit 
into the classification scheme as 

Ki, = -log '?{ain\<t) = 'K{€\atn) (14) 

and K = min(K£). Thus, k can be seen as the information Hamiltonian of the classification 
scheme, taken for the association for which it becomes minimal. Large values for k are 
an indication of classification exceptions. Following the US performance artist Laurie 
Anderson"^ we call such cases odd objects: While exceptions are usually expected to be 
results of instrumental errors or defects in the target catalogs ("just odd"), they could 
also indicate the discovery or a new, unexpected object type ("useful"). 

Introducing an exception class. That our method mingles candidates for rejection 
with candidates for discovery is a defect obtained by forcing the condition 'S'{M\(t) = 1 
onto an, in principle arbitrary, classification scheme C To overcome this problem, we 
introduce an odd-object class Mq defined by a single parameter 

^ = -log(a'(a,D|Mo)y(Mo)) (15) 

for all a^D. As Mq is the logical complement of the set M, we have {{Mo,M„}}; without 
conditions, and we obtain the total evidence 

'?{aen) = y{atJ])\<t) + e-^ . (16) 



Laurie Anderson, United States Live, Warner Bros. (1983) 



Odd Objects 
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This implies ceo = Kg — ^, and we define the confidence for an object to be odd as 

Co = K-C . (17) 

Moreover, objects of classes M„ need to fulfill J'(a^D|M„) J'(M„) > for some in in 
order to receive a rating B or above, while there was no such limit in C To prevent that 
odd objects are accidentally considered "clear cases" if CP(a^D|M„)CP(M„) <c for all 
in, we introduce a sub-rating Ao c A for cq > Cum, which requires human inspection. 

DISCUSSION AND PHILOSOPHICAL EPILOGUE 
Benefits of Bayesian Classification 

Models and Priors: Experience vs. Bias. In Bayesian classification we use models 
and priors, which are usually suspected in catalog generation to introduce bias. Shouldn't 
we use only the information contained in the given data set in order to be objective? 
Our Bayesian answer is: No, we shouldn't, and in fact, we never do. In general, it is 
the advantage of Bayesian methods to clearly state our priors, while orthodox methods 
often hide the prior assumptions used. For the special case of catalog generation, this 
means that we always have additional data available, usually in a complex and incoherent 
form, and also widely accepted models describing the nature of our potential objects, 
and these data and models are used by "experienced astronomers" in the process called 
catalog validation. All we do by introducing models and priors is to automate part of this 
experience, i.e., provide a condensed description of our prior knowledge and beliefs to 
the classification procedure. Our quality rating ensures that this affects only the trivial, 
routine tasks of validation, and prevents that potentially interesting alternatives to the 
best assignments are prematurely dropped {e.g., cases with £max / ^max)- 

Beyond Best Fits: Robustness and Model Complexity. Our method exhibits a fun- 
damental aspect of Bayesian classification: Model parameters are not optimized as in 
"best-fit" approaches, but marginalized in Eq. 4 and 6. We emphasize that this is im- 
plied by plausibility logic: It is not our question which model can produce an optimal 
fit to the data for some parameter choice, but which model explains the data in the most 
natural way, given prior expectations for its parameters. 

To discuss this in more detail, let us consider one parameter dimension uji of the 
model parameter space and assume that ~ p„(a;\j)/|r2„j|, with for uji G Vlni 

and Pri(a^) ^ otherwise. Moreover, we assume that for = we achieve 3-C°j = 
IK(M„\j|a£D) integrated over all parameter dimensions except w^. Varying uji may de- 
crease ^K(M„\j|a^D) to a value ~ 3<^- < ^K°j (i.e., increase the likelihood) in some 
regime G r?^-, while it decreases the likelihood (^K(M„\j|a£D) ~ 3<^- > 3<°j) in some 
other regime Ui G r^^-. Everywhere else we assume 3<(M„\j|a£D) ~ IK° j. Defining 

A„^. = ±(e---^--l) and = ^ , (18) 

we immediately obtain for the change in confidence caused by parameter Ui 

[Ac,„]^^~log(l + A+W^+-A-W^-,) . (19) 
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A significant increase of the model confidence is only obtained if X^^W^- — ^"i^ni $ 1' 
i.e., if a significant net improvement of the fit quality averaged over the "prior mass" Qni 
of the parameter is achieved. We shall call parameters with this property robust, while 
parameters with A^-VF^ < A~-W~^ shall be called fragile. 

The factors Wi are equivalent to the Ockham factors defined by Jaynes [7, § 20], 
referring to the principle of simplicity known as Ockhams razor. However, Eq. 19 shows 
that Bayesian logic does not lead to a flat penalization of model complexity; rather, a 
parameter which does not affect the fit quality (W^ = = 0) does not affect the model 
confidence. It therefore seems more appropriate to say that Bayesian logic penalizes fine 
tuning, i.e., the introduction of fragile parameters with little prior constraints for the 
mere purpose to improve the "best fit" for some particular choice of parameter values.^ 

Bayesian Learning: Updating the Classification Scheme. Classification is naturally 
applied to a large number of objects O = {[aDM]s}_L, which allows us to use posterior 
number distributions to iteratively update all prior assumptions which we have entered. 
In particular total model priors ^P(M„) can be updated as 

N(0|a) ^-^^-^^^"^ ' ^^"^ 

where 0|a [0|„,a] denotes the set of all A rated objects [in model class n]. In the same 
way, updates can be applied to the shape of prior p.d.f.s of the models, if these are 
determined by empirical parameters. 

The most important parameter for posterior updates is hereby the odd object threshold 
^. If we consider J'(M„) determined by Eq. 20 as a function of and call it Ro{C), 
we note that -Ro(O) = 1 and i2o(0 — > for ^ — > oo. If classification exceptions hide a 
class of undiscovered objects with particular properties, we would expect that they are 
grouped around some large value of a ^, while all objects fitting into the classification 
scheme have small values of ^. In between, we expect a range where RoiO remains 
approximately constant, and a good choice of ^ for separating the two populations is 
then found by maximizing 

e(e) = e+:^iogi2o(e) (21) 

within the range of ^ where i?o(0 > 0. Once ^ is found, we can update all model priors 
by Eq. 20. 

In principle, every update is a redefinition of the classification scheme and the goal 
of our iterative process is to find a converging chain of updates ^' y €" y . . ., until a 
self consistent result is obtained. If this does not succeed, our conclusion might be that 
the classification task is ill-defined, and we may exchange our classification scheme <t 
by an entirely different €*, containing other models to define object classes. 



In his discussion of this topic on p. 605-607 of his book [7], Jaynes implicitly assumes that the likelihood 
is significantly different from zero only within rj^.^. If a moderately good match 3-C°j has been achieved 
without the parameter uji, this is equivalent to setting A^^ = 1 and = 1 — W,^; in Eq. 19, yielding 
[Aq„](j; ~ JC^j — Jt^jj + \ogW^^. Now the Ockham factor indeed penalizes the model complexity as it 
requires ^ — log^r^i for significant improvement of confidence (note that logM^^ < 0). 
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Classification and Inference 



Interpretation Schemes and Anomalies. With these considerations we make the link 
from Bayesian classification to Bayesian inference. There, we confront a set of models 
or theories — we call it the interpretation scheme 3 — with a series of data sets Dg, 
which we now call tests of the interpretation scheme, expecting that subsequent tests 
will lead to a more and more reliable estimation of the free parameters in our model 
space. Occasionally, however, results of experiments will not fit at all into the picture 
(k > 1), and we then call them anomalies. Normally, we will cope with anomalies 
by successively extending the parameter spaces of models (3 y J' y 3" y . . .), but if 
anomalies become rampant, we will have to doubt the validity of our interpretation 
scheme as a whole. This may lead us to replace it with a new scheme involving entirely 
new theories (3 — ^ 3*), involving a reinterpretation of all data sets observed so far. 

The Course of Science in a Bayesian View. The gentle reader may have noticed 
that our interpretation scheme is what Thomas Kuhn has called a paradigm [8]. In a 
Bayesian language, it is that part of our "web of beliefs" which is kept unchanged in 
technical applications, slowly modified in the normal course of science, but questioned 
and eventually been overthrown when confronted with overwhelming anomalies. We 
have identified the counter-evidence as a measure to monitor such developments. 

We may write 3{t) for an interpretation scheme continuously modified over time, 
and define R{t) as its average counter-evidence. 3{t) can then be identified with Imre 
Lakatos' concept of a research programme [9], and the sign of dR/dt would indicate 
whether it is "progressing" (dR/dt < 0) or "degenerating" (dR/dt > 0). Degeneration of 
a research programme — or the decline of a paradigm — is hereby not only caused by 
experimental anomalies, but also by fragile parameters introduced to cope with them. At 
the end of the road, we may enter into that what Kuhn calls a scientific revolution, the 
incommensurable paradigm shift J — )• J*, by which all known data obtain a new meaning 
[8] . A further exploration of these topics would be beyond the scope of this paper, but it is 
intriguing to note how Bayesian methods allow a quantitative understanding of concepts 
in the philosophy of science which are otherwise considered irrational. 
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